-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix (22477) dtype=str converts NaN to 'n' #22564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix (22477) dtype=str converts NaN to 'n' #22564
Conversation
Codecov Report
@@ Coverage Diff @@
## master #22564 +/- ##
==========================================
+ Coverage 92.28% 92.28% +<.01%
==========================================
Files 161 161
Lines 51457 51461 +4
==========================================
+ Hits 47489 47493 +4
Misses 3968 3968
Continue to review full report at Codecov.
|
e94bebe
to
e4eb011
Compare
Just saw that this failed. I am travelling tomorrow but will pick it up asap 👍 |
@Nikoleta-v3 : Awesome! Also, don't forget the |
pandas/core/dtypes/cast.py
Outdated
dtype = np.float64 | ||
subarr = np.empty(length, dtype=dtype) | ||
dtype = np.dtype('float64') | ||
if isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dtype.kind in ("U", "S")
can be replaced with is_string_dtype(dtype)
.
is_string_dtype
can be found in pandas/core/dtypes/common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will replace asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also be an elif
pandas/core/dtypes/cast.py
Outdated
dtype = np.float64 | ||
subarr = np.empty(length, dtype=dtype) | ||
dtype = np.dtype('float64') | ||
if isinstance(dtype, np.dtype) and is_string_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the isinstance
check for np.dtype needed here?
if isinstance(dtype, np.dtype) and is_string_dtype(dtype): | ||
subarr = np.empty(length, dtype=object) | ||
if not isna(value): | ||
value = to_str(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to have all of the dtype checking in the if/elif/else and then construct the subarr after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, dtype
needs to be overwritten with object
because we don't want actual string dtypes (which is easy to do, just noting :-))
pandas/core/dtypes/common.py
Outdated
@@ -367,6 +367,8 @@ def is_datetime64_dtype(arr_or_dtype): | |||
tipo = _get_dtype_type(arr_or_dtype) | |||
except TypeError: | |||
return False | |||
except UnicodeEncodeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what hits this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an array of a Unicode element. The problem is with numpy's dtype
I opened an issue a few days ago: numpy/numpy#11860
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can catch the TypeError and UnicodeDecodeError on a single line (unless we maybe want to add a specific comment pointing to the numpy issue explaining why)
@@ -137,6 +137,26 @@ def test_constructor_no_data_index_order(self): | |||
result = pd.Series(index=['b', 'a', 'c']) | |||
assert result.index.tolist() == ['b', 'a', 'c'] | |||
|
|||
def test_constructor_no_data_string_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls parametrize these tests
# GH 22477 | ||
result = pd.Series(index=[1], dtype=str) | ||
assert result.isna().all() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the value using iloc instead here which returns a scalar
Hey everyone! Thank you very much for your comments. I haven't forgotten about this, I am currently very busy preparing for PyCon UK. I will make sure to fix the failures and address your comments as soon as possible. |
can you rebase and fixup |
More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series
Add a check so if the dtype is str is will create an empty array type object and then pass the values. Add test for an empty series. To chech that it fills the series with NaN and not with 'n'. Also add a test for cases that no string values are given.
To allow the developers to remember why the specific test was added
This is currently failing.
is_datetime64_dtype is trying to check the type of unicodes but numpy does not support unicode and this line breaks. Add except error and return false Test for unicode still fails for python 2
This was breaking for python 2. The fix is to use pandas text_type to return string type
parametrize tests and use iloc to check value
76bbb9c
to
ee854d7
Compare
Hello @Nikoleta-v3! Thanks for updating the PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note as well. In the Missing section of bug fixes
pandas/core/dtypes/cast.py
Outdated
dtype = np.float64 | ||
subarr = np.empty(length, dtype=dtype) | ||
dtype = np.dtype('float64') | ||
if isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also be an elif
result = pd.Series(index=[1], dtype=str) | ||
assert np.isnan(result.iloc[0]) | ||
|
||
@pytest.mark.parametrize('item', ['13']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to parameterize this test (only 1 case), and you need to change the name as the next test overwrites it.
@@ -807,7 +807,7 @@ def test_constructor_corner_shape(self): | |||
|
|||
@pytest.mark.parametrize("data, index, columns, dtype, expected", [ | |||
(None, lrange(10), ['a', 'b'], object, np.object_), | |||
(None, None, ['a', 'b'], 'int64', np.dtype('int64')), | |||
(None, None, ['a', 'b'], 'int64', np.dtype('float64')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @TomAugspurger this makes it pass, but this looks slightly suspect, e.g. in master
In [4]: pd.DataFrame(columns=list('ab'),dtype=int).dtypes
Out[4]:
a int64
b int64
dtype: object
though we coerce almost always on assignment
In [9]: df = pd.DataFrame(columns=list('ab'),dtype=int)
In [10]: df.loc[0] = 5
In [11]: df
Out[11]:
a b
0 5 5
In [12]: df.dtypes
Out[12]:
a int64
b int64
dtype: object
In [13]: df = pd.DataFrame(columns=list('ab'),dtype=int)
In [14]: df.loc[0, 'a'] = 5
In [15]: df
Out[15]:
a b
0 5.0 NaN
In [16]: df.dtypes
Out[16]:
a float64
b float64
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the DataFrame coercion might be wrong here
i think this is fixed now. |
Thank you for the commits @jreback 👍 |
pandas/core/dtypes/cast.py
Outdated
if not isna(value): | ||
value = to_str(value) | ||
else: | ||
subarr = np.empty(length, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback by putting this here in the else
statement, there is no creation of subarr
if you are in the first if case of if length and is_integer_dtype(dtype) and isna(value)
(which seems to indicate this is not covered by the tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do u gave have a case that doesn’t work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do u gave have a case that doesn’t work?
ok should be fixed up. |
@jorisvandenbossche if you have any other issues. |
Thanks @Nikoleta-v3 and @jreback !
I suppose that integer line is still not covered in the tests, but yeah, that's not related to what this PR was trying to fix. |
…fixed * upstream/master: DOC: Removing rpy2 dependencies, and converting examples using it to regular code blocks (pandas-dev#23737) BUG: Fix dtype=str converts NaN to 'n' (pandas-dev#22564) DOC: update pandas.core.resample.Resampler.nearest docstring (pandas-dev#20381) REF/TST: Add more pytest idiom to parsers tests (pandas-dev#23810) Added support for Fraction and Number (PEP 3141) to pandas.api.types.is_scalar (pandas-dev#22952) DOC: Updating to_timedelta docstring (pandas-dev#23259)
More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series * Closes pandas-dev#22477
More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series * Closes pandas-dev#22477
git diff upstream/master -u -- "*.py" | flake8 --diff
Now:
This is implemented by adding a check so if the
dtype
isstr
is will create an empty array type object and then pass the values. Two tests have been implemented: